Skip to content

docs(README): remove degenerate DFlash perf row from #85 perf table#88

Merged
solderzzc merged 1 commit into
SharpAI:mainfrom
ericjlake:fix/readme-dflash-row-cleanup
Apr 26, 2026
Merged

docs(README): remove degenerate DFlash perf row from #85 perf table#88
solderzzc merged 1 commit into
SharpAI:mainfrom
ericjlake:fix/readme-dflash-row-cleanup

Conversation

@ericjlake
Copy link
Copy Markdown
Contributor

Follow-up to #85, which merged with a Qwen3-A3B perf table that included a --dflash row showing 70 tok/s on medium/long prompts. Subsequent benchmarking found that headline number was always degenerate output"and and and...", "**UMA** **UMA**...", etc. (longest run of identical tokens up to 488 in a row).

Root cause

DFlashRuntime.greedyTokensWithMask uses argMax (pure greedy) for both draft and verify, regardless of the request's temperature. Vanilla SwiftLM samples stochastically at temp=0.6 which breaks ties between high-prob tokens. DFlash's pure greedy decoding has no tie-breaker and locks into low-entropy attractors. Once locked, draft and target both keep predicting the same connective ("and", "UMA", etc.), all 16 positions of every verify pass commit, and the loop self-reinforces. High acceptance + high throughput, but unusable output.

Why we didn't catch it earlier

  • Server-side [SwiftLM] DFlash summary: ... 70.3 tok/s line reports throughput, not quality.
  • High acceptance is consistent with degenerate output (target & draft both lock onto the same predictable token).
  • Snippets visible at the tail of summary log lines ("11", "Let's") were the last few tokens of repetitive runs, not clean prose.

Vanilla generation (no DFlash) on the same 5 prompts: clean output, 60.4 tok/s avg, uniqueness ratios 0.60–0.84.

Mitigation attempts

We added standard repetition penalty (mirrors MLXLMCommon.RepetitionContext) inside DFlashRuntime.greedyTokensWithMask with a 64-token ring buffer. Results across 5 diverse prompts:

Penalty Clean outputs Best clean t/s Notes
1.0 (off) 0/5 all degenerate
1.1 1/5 37 most still degenerate; longest-run reduced 488 → still 244 on others
1.3 1/5 15 fixes more attractors but acceptance crashes 80% → 18-46%, throughput tanks below vanilla

Rep penalty is the wrong tool — at 1.1 too weak to dislodge attractors (logit demote only ~9%, attractor gap is often 10+ logit points); at 1.3 strong enough to break loops but also makes target reject draft picks when draft was greedy on a token target wants to slightly demote → DFlash's strict == accept check forces only the first matching position to commit, killing the speedup.

The proper fix is in DFlash itself

This is the same root cause as the 122B SSD-stream finding tracked at z-lab/dflash#91 (acceptanceLen=0|1 → I/O fan-out kills throughput). Both reduce to: DFlash's argmax-greedy verify path can't tolerate sampler-controlled diversity on the target side.

The proper fix is stochastic posterior sampling with rejection-based accept (Leviathan/Chen formulation): target samples from softmax at temperature T; draft proposed token d accepted iff r ~ U(0,1) < min(1, p_target(d) / p_draft(d)). Preserves target distribution and converts the rigid == accept into a probabilistic check that doesn't fall off a cliff on small disagreements. That's a DFlash architecture change, tracked upstream.

This PR

Replaces the misleading --dflash perf row with a clear warning, so users don't adopt a degenerate codepath as the recommended config. Vanilla 60.4 tok/s remains the honest production number for now.

The --dflash flag itself stays in place (no code changes) — the issue is config recommendation, not the implementation. Once the upstream fix lands, we can re-add the row with verified-clean numbers.

Test plan

  • No code changes; README only.
  • Same vanilla benchmark numbers (61.7 / 62.3 / 62.1 tok/s) verified clean output via uniqueness-ratio + max-run checks.

References

Follow-up to SharpAI#85 (just merged). Subsequent benchmarking discovered the
70 tok/s DFlash medium/long numbers in that PR were ALWAYS degenerate
output ("and and and...", "**UMA** **UMA**...") — high acceptance because
draft and target both committed to the same locked-in token every block.
Root cause: DFlash uses argMax greedy regardless of request temperature.
Vanilla samples stochastically at temp=0.6 which breaks ties; DFlash has
no tie-breaker and locks into low-entropy attractors.

Mitigation experiments (rep-penalty 1.1, 1.3) only partially help: 1.1
is too weak to dislodge hard attractors (1/5 prompts clean), 1.3 fixes
attractors but acceptance crashes 80%->18-46% so DFlash becomes net-
negative below vanilla. Proper fix is stochastic posterior sampling with
rejection-based accept (Leviathan/Chen), tracked at z-lab/dflash#91.

Replaces the misleading row with a clear warning so users do not adopt
a degenerate codepath as the recommended config.

See z-lab/dflash#91 (issuecomment 4322584783) for the full diagnosis.
@solderzzc solderzzc merged commit b11e61e into SharpAI:main Apr 26, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants